The term “Selection bias” can refer to different things depending on the discipline
Economists’ selection bias (selection on observables) is actually epidemiologists’ confounding bias.
Survey statisticians use selection bias term to sample selection from population, which can lead to biased conclusions in descriptive research.
Epidemiologists excelled in conceptualizing and making the distinction between confounding and selection.
It’s a good habit to make sure that your collaborators are on the same page regarding terminology to avoid confusions.
This type of selection bias cannot happen under the null.
This type of selection bias can happen even without conditioning on colliders.
\(A\) : Treatment
\(Y\) : Fetal malformation
\(C\) : Live birth
We don’t have \(Y\) for dead fetuses, so we essentially restricting our analysis to living fetuses.
\[ \frac{Pr[Y = 1|A = 1, C = 0]}{ Pr[Y = 1|A = 0, C = 0]} \]
Is this a valid estimate for our estimand?
\[\frac{Pr[Y^{ a=1} = 1]}{Pr[Y^{ a=0} = 1]} \]
The answer is no, because we have association transmitted through the path \(A \rightarrow C \leftarrow Y\)
\(A\) : Treatment
\(Y\) : Fetal malformation
\(C\) : Live birth
\(S\): Parental grief
A descendant of a collider is as dangerous as the collider itself.
Including randomized trials
\(A\) : Antiretroviral treatment
\(Y\): Death
\(L\): Disease severity
\(U\): High level of immunosuppression
\(C\): Loss to follow-up
Remember, \(C\) is not a variable we put in a regression model. It’s a part of how your analyzed data was formed.
In this example, \(A\) can show favorable result not because it’s actually effective in reducing mortality, but because it caused sick people to leave the study. Although in reality, \(A\) and \(Y\) are not associated.
The previous DAG is an example of selection bias due to differential loss-to-follow-up or informative censoring.
These DAGs are modified versions of Figure 8.3
\(E\): Estrogen use
\(D\): CHD
\(F\): Hip fracture
\(C\): Selection into the study
Selection bias to refer to all biases that arise from conditioning on a common effect of two variables, one of which is either the treatment or a cause of treatment, and the other is either the outcome or a cause of the outcome.
Selection bias, similar to confounding bias, is a violation of the exchangeability assumption.
Differential loss to follow-up or informative censoring.
Missing data bias, or non-response bias.
Healthy worker bias.
Self-selection bias or volunteer bias.
Selection affected by treatment received before study entry aka prevalent-user bias
Immortal-time bias is a mix of selection and misclassification bias.
All of them, even randomized experiments.
Randomization fixes confounding but not selection.
Selection bias is more likely to occur with designs that are built on selection by default i.e. case-control design
Conventional covariate adjustment in treatment-confounder feedback setting.
Cox regression.
RCTs are conducted among volunteers willing to enter the experiment. So those volunteers select into the trial.
However, this is not what we mean here by selection bias.
Based on our definition, the selection variable should be a common effect of the treatment or a cause of the treatment and the outcome or cause of the outcome.
Since volunteering participation happened before treatment assignment, there is no bias.
The self-selection bias we mentioned earlier is about agreeing to continue in the trial after being treated.
\(A\): Physical activity.
\(Y\): Heart disease
\(C\) : Being a firefighter
\(L\): Parental socieconomic status
\(U\): Attraction towards physical activity
It can guide the choice of the analytic method
It can help is study design and data collection.
Selection bias resulting from conditioning on pre-treatment variables (e.g., being a firefighter) could explain why certain variables behave as “confounders” in some studies but not others.
Causal diagrams enhance communication among investigators and may decrease the occurrence of misunderstandings.
\(A\): Treatment (protective)
\(Y_1 \ \text{and}\ Y_2\): Death at time 1 and time 2.
\(U\): Protective Haplotype
\[ aRR_{AY_1} = \frac{Pr[Y_1=1|A=1]}{ Pr[Y_1=1|A=0]} \]
\[ aRR_{AY_2} = \frac{Pr[Y_2=1|A=1]}{ Pr[Y_2=1|A=0]} \]
\[ HR_{AY_1} = aRR_{AY_1} = \frac{Pr[Y_1=1|A=1]}{ Pr[Y_1=1|A=0]} \]
\[ HR_{AY_2} = aRR_{AY_2|Y_1=0} = \frac{Pr[Y_2=1|A=1,Y_1=0]}{ Pr[Y_2=1|A=0, Y_1 = 0]} \]
In conclusion, we have two issues:
The estimand changed.
Selection bias
\[\frac{Pr[Y^{ a=1,c=0} = 1]}{Pr[Y^{ a=0,c=0} = 1]} \]
This reads as the effect of \(A\) on \(Y\) had everyone got \(A\) and remained uncensored vs everyone not getting \(A\) and remained uncensored.
Weighting can be a good approach to achieve this (See example).